{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Phrase Embedding\n",
    "\n",
    "Phrase embedding的实现，基本上和word2vec没啥差别。\n",
    "\n",
    "主要步骤如下：\n",
    "\n",
    "* 抽取出Phrase，然后在语料中的合适位置插入这些phrase\n",
    "* 使用word2vec网络，基于插入phrase的语料，训练模型\n",
    "\n",
    "注意：\n",
    "> embedding层词典的数量，需要把phrase计算在内\n",
    "\n",
    "\n",
    "phrase的抽取，可以使用PMI（Pointwise Mutual Information），详情参考：[word2vec-for-phrases-learning-embeddings-for-more-than-one-word](https://towardsdatascience.com/word2vec-for-phrases-learning-embeddings-for-more-than-one-word-727b6cf723cf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 插入phrase\n",
    "\n",
    "通过上述方法，已经准备好了phrase，假设文件是**phrase.txt**。接下来就是把这些phrase插入到语料中。\n",
    "\n",
    "方法如下：\n",
    "\n",
    "* 使用jieba对句子分词\n",
    "* 选定一个窗口，对在这个窗口的词语，两两组合成phrase，如果phrase在文件phrase.txt中，则在该窗口中插入该phrase\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting jieba\n",
      "  Using cached https://files.pythonhosted.org/packages/71/46/c6f9179f73b818d5827202ad1c4a94e371a29473b7f043b736b4dab6b8cd/jieba-0.39.zip\n",
      "Building wheels for collected packages: jieba\n",
      "  Building wheel for jieba (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Stored in directory: /opt/userhome/kdd_luozhouyang/.cache/pip/wheels/c9/c7/63/a9ec0322ccc7c365fd51e475942a82395807186e94f0522243\n",
      "Successfully built jieba\n",
      "Installing collected packages: jieba\n",
      "Successfully installed jieba-0.39\n"
     ]
    }
   ],
   "source": [
    "!pip install jieba"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "def build_phrase_set(phrase_file):\n",
    "    phrases = set()\n",
    "    if not os.path.exists(phrase_file):\n",
    "        print('file does not exist: %s' % phrase_file)\n",
    "        return phrases\n",
    "    with open(phrase_file, mode='rt', encoding='utf8', buffering=8192) as fin:\n",
    "        for line in fin:\n",
    "            line = line.strip('\\n')\n",
    "            if not line:\n",
    "                continue\n",
    "            phrases.add(line)\n",
    "    return phrases\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1.5年', '1.6亿', '1.8亿', '1.responsible', '1.严格', '1.严格执行', '1.主导', '1.主持', '1.主要', '1.了解']\n"
     ]
    }
   ],
   "source": [
    "phrase_file = '/opt/algo_nfs/kdd_luozhouyang/tmp/phrase_v1.txt'\n",
    "phrase = build_phrase_set(phrase_file)\n",
    "print(sorted(list(phrase))[100:110])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "class SequenceProcessor:\n",
    "    \n",
    "    def __init__(self, phrase_file, window_size=5):\n",
    "        self.phrases = build_phrase_set(phrase_file)\n",
    "        self.window_size = window_size\n",
    "        \n",
    "    def process(self, input_file, output_file):\n",
    "        if len(self.phrases) == 0:\n",
    "            print('phrase set is empty.')\n",
    "            return\n",
    "        with open(output_file, mode='wt', encoding='utf8', buffering=8192) as fout, \\\n",
    "            open(input_file, mode='rt', encoding='utf8', buffering=8192) as fin:\n",
    "            for line in fin:\n",
    "                line = line.strip('\\n')\n",
    "                if not line:\n",
    "                    continue\n",
    "                res = self.process_line(line)\n",
    "                fout.write(res + '\\n')\n",
    "                \n",
    "    def process_line(self, line):\n",
    "        words = line.split(' ')\n",
    "        index = 0\n",
    "        left = 0\n",
    "        right = left + self.window_size\n",
    "        new_words = []\n",
    "        while True:\n",
    "            if right >= len(words):\n",
    "                new_words.append(self.process_window(words[left:]))\n",
    "                break\n",
    "            left = index\n",
    "            right = min(len(words), index + self.window_size)\n",
    "            selected = words[left:right]\n",
    "            new_words.append(self.process_window(selected))\n",
    "            index += self.window_size\n",
    "        return ' '.join(new_words)\n",
    "                \n",
    "    def process_window(self, words):\n",
    "        res = []\n",
    "        concated = []\n",
    "        for i in range(len(words) - 1):\n",
    "            for j in range(i + 1, len(words)):\n",
    "                c = words[i] + words[j]\n",
    "                if c in self.phrases:\n",
    "                    concated.append(c)\n",
    "        \n",
    "        # 交错穿插phrase到words中间。也可以随机打乱插入\n",
    "        long_list = words if len(words) >= len(concated) else concated\n",
    "        short_list = words if len(words) < len(concated) else concated\n",
    "        \n",
    "        i = 0\n",
    "        while i<len(short_list):\n",
    "            res.append(long_list[i])\n",
    "            res.append(short_list[i])\n",
    "            i+=1\n",
    "        while i<len(long_list):\n",
    "            res.append(long_list[i])\n",
    "            i+=1\n",
    "        return ' '.join(res)\n",
    "                    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.严格 1.严格执行 执行 1.主导 条款 1. 主导 支持\n"
     ]
    }
   ],
   "source": [
    "p = SequenceProcessor(phrase_file)\n",
    "words = ['1.严格', '执行', '条款', '1.', '主导', '支持']\n",
    "print(p.process_window(words))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.严格 1.严格执行 执行 1.主导 条款 1. 主导 支持 支持\n"
     ]
    }
   ],
   "source": [
    "line = '1.严格 执行 条款 1. 主导 支持'\n",
    "print(p.process_line(line))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## word2vec\n",
    "\n",
    "准备好语料之后，就是word2vec的步骤了：\n",
    "* 构建数据输入管道\n",
    "* 构建网络，训练模型\n",
    "\n",
    "\n",
    "注意：\n",
    "\n",
    "> 建议使用现成的库来训练word2vec，例如gensim\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 构建数据输入管道\n",
    "\n",
    "使用tf.data API构建数据输入管道。可以考虑负采样。\n",
    "\n",
    "如果使用keras内置的skipgram函数构建训练数据，请参考：[A word2vec keras tutorial](https://adventuresinmachinelearning.com/word2vec-keras-tutorial/)，它支持负采样。\n",
    "\n",
    "如果准备使用tf.data API构建数据输入管道，同时又需要进行负采样，则可以按照以下步骤：\n",
    "* 使用skipgram方式准备训练数据，文件格式为：'center context label'\n",
    "* 写程序对上述的训练数据文件进行负采样，生成新的负采样训练数据文件，格式和之前保持一致\n",
    "* 使用tf.data API通过上述两种文件构建数据输入管道，记得一定要shuffle,并且buffer_size要设置的足够大\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 构建网络\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "\n",
    "class Word2Vec(tf.keras.Model):\n",
    "    \n",
    "    def __init__(self, vocab_size, embedding_size):\n",
    "        super(Word2Vec, self).__init__(name='word2vec')\n",
    "        self.vocab_size = vocab_size\n",
    "        self.embedding_size = embedding_size\n",
    "        \n",
    "        self.reshape = tf.keras.layers.Reshape((self.embedding_size, 1), name='reshape')\n",
    "        self.embedding = tf.keras.layers.Embedding(self.vocab_size, self.embedding_size, name='embedding')\n",
    "        self.cosine = tf.keras.layers.Dot(axes=1, normalize=True, name='cosine')\n",
    "        self.out = tf.keras.layers.Dense(1, activation='sigmoid', name='out')\n",
    "        \n",
    "    def call(self, inputs, training=True, mask=None):\n",
    "        target, context = inputs\n",
    "        target = self.reshape(target)\n",
    "        context = self.reshape(context)\n",
    "        target_embedding = self.embedding(target)\n",
    "        context_embedding = self.embedding(context)\n",
    "        cos = self.cosine([target, context])\n",
    "        out = self.out(cos)\n",
    "        res = {\n",
    "            'out': out,\n",
    "            'cos': cos\n",
    "        }\n",
    "        return res\n",
    "    \n",
    "\n",
    "def build_word2vec_model(vocab_size, embedding_size):\n",
    "    target_input = tf.keras.layers.Input(shape=(1,), name='target_input')\n",
    "    context_input = tf.keras.layers.Input(shape=(1,), name='context_input')\n",
    "    \n",
    "    embedding = tf.keras.layers.Embedding(vocab_size, embedding_size, name='embedding')\n",
    "    target = embedding(target_input)\n",
    "    context = embedding(context_input)\n",
    "    \n",
    "    reshape = tf.keras.layers.Reshape((embedding_size, 1))\n",
    "    target = reshape(target)\n",
    "    context = reshape(context)\n",
    "    \n",
    "    cosine = tf.keras.layers.Dot(axes=1, normalize=True, name='cos')([target, context])\n",
    "    cosine = tf.keras.layers.Reshape((1,))(cosine)\n",
    "    out = tf.keras.layers.Dense(1, activation='sigmoid', name='out')(cosine)\n",
    "    \n",
    "    model = tf.keras.Model(inputs=[target_input, context_input], outputs=[cosine, out])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "This model has not yet been built. Build the model first by calling `build()` or calling `fit()` with some data, or specify an `input_shape` argument in the first layer(s) for automatic build.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-26-4502d624fd4c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mw2v\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mWord2Vec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m128\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mw2v\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mloss\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0;34m'output_1'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'binary_crossentropy'\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptimizer\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'sgd'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mw2v\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msummary\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/.conda/envs/machine-learning-notes/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py\u001b[0m in \u001b[0;36msummary\u001b[0;34m(self, line_length, positions, print_fn)\u001b[0m\n\u001b[1;32m   1575\u001b[0m     \"\"\"\n\u001b[1;32m   1576\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuilt\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1577\u001b[0;31m       raise ValueError('This model has not yet been built. '\n\u001b[0m\u001b[1;32m   1578\u001b[0m                        \u001b[0;34m'Build the model first by calling `build()` or calling '\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1579\u001b[0m                        \u001b[0;34m'`fit()` with some data, or specify '\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: This model has not yet been built. Build the model first by calling `build()` or calling `fit()` with some data, or specify an `input_shape` argument in the first layer(s) for automatic build."
     ]
    }
   ],
   "source": [
    "w2v = Word2Vec(100, 128)\n",
    "w2v.compile(loss={'output_1': 'binary_crossentropy'}, optimizer='sgd')\n",
    "w2v.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "W0507 16:54:31.956479 140213465438016 training_utils.py:1152] Output reshape_12 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to reshape_12.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"model_7\"\n",
      "__________________________________________________________________________________________________\n",
      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
      "==================================================================================================\n",
      "target_input (InputLayer)       [(None, 1)]          0                                            \n",
      "__________________________________________________________________________________________________\n",
      "context_input (InputLayer)      [(None, 1)]          0                                            \n",
      "__________________________________________________________________________________________________\n",
      "embedding (Embedding)           (None, 1, 128)       12800       target_input[0][0]               \n",
      "                                                                 context_input[0][0]              \n",
      "__________________________________________________________________________________________________\n",
      "reshape_11 (Reshape)            (None, 128, 1)       0           embedding[0][0]                  \n",
      "                                                                 embedding[1][0]                  \n",
      "__________________________________________________________________________________________________\n",
      "cos (Dot)                       (None, 1, 1)         0           reshape_11[0][0]                 \n",
      "                                                                 reshape_11[1][0]                 \n",
      "__________________________________________________________________________________________________\n",
      "reshape_12 (Reshape)            (None, 1)            0           cos[0][0]                        \n",
      "__________________________________________________________________________________________________\n",
      "out (Dense)                     (None, 1)            2           reshape_12[0][0]                 \n",
      "==================================================================================================\n",
      "Total params: 12,802\n",
      "Trainable params: 12,802\n",
      "Non-trainable params: 0\n",
      "__________________________________________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "w2v_model = build_word2vec_model(100, 128)\n",
    "w2v_model.summary()\n",
    "w2v_model.compile(loss={'out': 'binary_crossentropy'}, optimizer='sgd')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### gensim训练word2vec\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -q gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "W0507 17:52:54.546020 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 17:52:54.580649 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 17:52:54.615082 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 17:52:54.649325 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 17:52:54.683369 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 17:52:54.716284 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 17:52:54.749374 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 17:52:54.779373 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([ 0.34674248,  0.35570675, -0.1457441 ,  0.00879948,  0.55500114,\n",
       "        0.38890362,  0.37437907, -0.27349955, -0.33025077,  0.3912286 ,\n",
       "        0.1590332 , -0.1091135 ,  0.10804082, -0.41607952,  0.11582852,\n",
       "        0.02014425, -0.3567299 ,  0.13947669,  0.7742888 , -0.41307807,\n",
       "       -0.36171225, -0.13599542, -0.04251283, -0.19304901, -0.26372176,\n",
       "       -0.3122111 , -0.31131655,  0.04328417,  0.6876864 ,  0.86348176,\n",
       "        0.44242084, -0.4835791 , -0.27528867,  0.34724486,  0.26832047,\n",
       "        0.17167363,  0.00260511, -0.69957006,  0.24595639,  0.7650556 ,\n",
       "       -0.74492747, -0.41633615,  0.19623893,  0.18933427,  0.42557034,\n",
       "       -0.07534644, -0.27418914,  0.28443357, -0.07384911, -0.2399095 ,\n",
       "        0.08272726, -0.28894734, -0.56103647, -0.01218327,  0.46845567,\n",
       "        0.43867397,  0.02944496, -0.02013955, -0.43548632, -0.4004143 ,\n",
       "       -0.15792081,  0.2640149 , -0.00460577,  0.44588038, -0.20286655,\n",
       "       -0.16475451,  0.10544273,  0.47026154, -0.38691202, -0.32563987,\n",
       "        0.18999799,  0.1412622 , -0.28497273,  0.26131025, -0.55693984,\n",
       "        0.236121  , -0.6224107 , -0.6297036 ,  0.33352986, -0.07437032,\n",
       "        0.13476905, -0.66396487, -0.20812301,  0.45992637,  0.2952906 ,\n",
       "        0.14182511,  0.12882109, -0.2410688 ,  0.31629702,  0.17738922,\n",
       "        0.19243604,  0.01158332,  0.33083722,  0.75173104,  0.65473104,\n",
       "       -0.2639982 , -0.28223032, -0.30263805, -0.2172858 , -0.01222184],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import logging\n",
    "\n",
    "import gensim\n",
    "from gensim.models import Word2Vec\n",
    "\n",
    "logging.basicConfig(filename=\"/opt/algo_nfs/kdd_luozhouyang/tmp/w2v.log.20190507\", level=logging.INFO)\n",
    "\n",
    "# 如果有多个文件呢？请看下面的MultiFileSentence\n",
    "sentence = gensim.models.word2vec.LineSentence('/opt/algo_nfs/kdd_luozhouyang/tmp/part-01023')\n",
    "\n",
    "# 还有哪些参数，该怎么设置？请看文档 \n",
    "# https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/models/word2vec/index.html#models.word2vec.Word2Vec\n",
    "model = Word2Vec(sentence, window=5, negative=5)\n",
    "\n",
    "model.save('/opt/algo_nfs/kdd_luozhouyang/models/gensim/w2v.20190507.model')\n",
    "model.wv.save_word2vec_format('/opt/algo_nfs/kdd_luozhouyang/models/gensim/w2v.20190507.vec')\n",
    "model.wv['开发']\n",
    "\n",
    "\n",
    "# with open('w2v.vec', mode='rt', encoding='utf8', buffering=8192) as fin:\n",
    "#     count = 0\n",
    "#     for line in fin:\n",
    "#         print(line)\n",
    "#         count += 1\n",
    "#         if count == 10:\n",
    "#             break\n",
    "            "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "class MultiFileSentence:\n",
    "    \n",
    "    def __init__(self, files):\n",
    "        self.files = files\n",
    "        \n",
    "    def __iter__(self):\n",
    "        for f in self.files:\n",
    "            with open(f, mode='rt', encoding='utf8', buffering=8192) as fin:\n",
    "                for line in fin:\n",
    "                    line = line.strip('\\n')\n",
    "                    if not line:\n",
    "                        continue\n",
    "                    words = line.split(' ')\n",
    "                    yield words\n",
    "                    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "W0507 19:09:24.579389 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n",
      "W0507 19:09:24.600231 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n"
     ]
    }
   ],
   "source": [
    "from gensim.models import Word2Vec\n",
    "\n",
    "files = ['/opt/algo_nfs/kdd_luozhouyang/tmp/part-01023',\n",
    "        '/opt/algo_nfs/kdd_luozhouyang/tmp/part-01023',]\n",
    "multi_file_sentence = MultiFileSentence(files)\n",
    "\n",
    "model = Word2Vec(multi_file_sentence, window=5, negative=5)\n",
    "model.save('/tmp/w2v.model')\n",
    "model.wv.save_word2vec_format('/tmp/w2v.vec')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "W0507 19:09:56.170234 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([-0.23979124, -0.35029036, -0.48673213,  0.11066395,  0.03688607,\n",
       "        0.4783164 ,  0.10762557, -0.55959433, -0.18692617, -0.16329728,\n",
       "        0.04612793, -0.18539032,  0.02333255, -0.50036645, -0.6414493 ,\n",
       "        0.36254427, -0.16179731,  0.01201913,  0.3043091 , -0.0179352 ,\n",
       "       -0.06606959, -0.28152755, -0.04192889,  0.03955165, -0.15794286,\n",
       "       -0.35448447,  0.08880788,  0.00767798,  0.19260862,  0.91442215,\n",
       "        0.48151648,  0.04959353, -0.42420253,  0.5900174 ,  0.3191452 ,\n",
       "        0.51084846, -0.05812714, -0.28591454,  0.15560418,  0.8640183 ,\n",
       "       -0.92507184, -0.6447743 ,  0.66764784, -0.0323838 ,  0.62571084,\n",
       "        0.10482378, -0.08711988,  0.16586128, -0.19405377,  0.16375169,\n",
       "       -0.28106654, -0.41007572, -0.11500295, -0.08608052, -0.19349962,\n",
       "        0.1785441 , -0.12186533,  0.08123282, -0.19011433, -0.36758485,\n",
       "       -0.02843239,  0.18410258,  0.29441646, -0.06380797,  0.3583782 ,\n",
       "       -0.49278077,  0.46070313,  0.2910697 , -0.40502158, -0.33644053,\n",
       "        0.05132522,  0.5443117 , -0.02138861, -0.36348036, -0.03147215,\n",
       "        0.01462314, -0.35532072, -0.5962006 ,  0.20956516, -0.16379467,\n",
       "       -0.11297157, -0.5136015 , -0.3343722 ,  0.52123594,  0.19512372,\n",
       "       -0.15842146,  0.22164236, -0.3669537 ,  0.1446174 , -0.19961458,\n",
       "        0.74838805,  0.07226578,  0.3681845 ,  0.26203114,  0.5108071 ,\n",
       "        0.27643618, -0.2104923 , -0.04684884, -0.14765073, -0.1739889 ],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = Word2Vec.load('/tmp/w2v.model')\n",
    "m.wv['开发']"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:machine-learning-notes] *",
   "language": "python",
   "name": "conda-env-machine-learning-notes-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
